Configuring a robots.txt file for Candidate Portal

A robots.txt file for a website tells search engine crawlers which pages and directories in the site the search engine should index for inclusion in the content it searches when somebody runs a search query. Search engine companies such as Google provide guidance on how to set up the robots.txt file to optimize your site's visibility to the search engine.

A robots.txt file can include instructions for crawlers to index specific pages or directories and ignore others. The instructions are not binding, so crawlers can disregard them. By default, Salesforce sites are set to disallow all except Google. If you wish other search engines or crawlers to index Candidate Portal pages, such as public Vacancy pages and the Vacancy list, you can set up a new robots.txt file.

Search engine crawlers look in the site's root directory - for example, https://mycompany.my.salesforce-sites.com/ - for the robots.txt file. If you have multiple Sites in your Org to host the Candidate Portal, Agency Portal and other features, the URLs usually have a suffix - for example, https://mycompany.my.salesforce-sites.com/recruitment for your Candidate Portal). To place the robots.txt file to the root level, create a new Site without a suffix.

To set up a robots.txt file for a Site:

Go to Setup > Custom Code > Visualforce Pages and select New:

Sage People displays the Visualforce Page page.
Enter a Label and Name for the page as ApplyRobots.

In the Visualforce Markup tab, enter details as follows:

Line	Description
User-agent: *	The user-agent is the specific crawler you want to target. Entering an asterisk (*) targets all crawlers. A robots.txt file can include multiple sets of instructions targeting different robots.
Allow: /recruitment/fRecruit__ApplyJobList	Instruction to index the Vacancy list in Candidate Portal where the Site suffix is "recruitment".
Allow: /recruitment/fRecruit__Apply	Instruction to index the Vacancy pages in Candidate Portal where the Site suffix is "recruitment".
Disallow: /	Instruction to stop crawlers from indexing all pages and directories on the website, including the root directory (/) and its sub directories, apart from the pages or directories included in Allow instructions.

Select Save.
Go to Setup > User Interface > Sites and Domains > Sites and select Edit next to the Site Label.
Select the Lookup icon next to Site Robots.txt, find and select ApplyRobots, then select Save.